Constructing of a Large-Scale Chinese-English Parallel Corpus

نویسندگان

  • Le Sun
  • Song Xue
  • Weimin Qu
  • Xiaofeng Wang
  • Yufang Sun
چکیده

This paper describes the constructing of a large-scale (above 500,000 pair sentences) Chinese-English parallel corpus. The current status of Chinese corpora is overviewed with the emphasis on parallel corpus. The XML coding principles for Chinese–English parallel corpus are discussed. The sentence alignment algorithm used in this project is described with a computer-aided checking processing. Finally, we show the design of the concordance of the parallel corpus and the prospect to further development.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Chinese-English Parallel Corpus Construction and its Application

Chinese-English parallel corpora are key resources for Chinese-English cross-language information processing, Chinese-English bilingual lexicography, Chinese-English language research and teaching. But so far large-scale Chinese-English corpus is still unavailable yet, given the difficulties and the intensive labours required. In this paper, our work towards building a large-scale Chinese-Engli...

متن کامل

Mining Large-scale Parallel Corpora from Multilingual Patents: An English-Chinese example and its application to SMT

In this paper, we demonstrate how to mine large-scale parallel corpora with multilingual patents, which have not been thoroughly explored before. We show how a large-scale English-Chinese parallel corpus containing over 14 million sentence pairs with only 1-5% wrong can be mined from a large amount of English-Chinese bilingual patents. To our knowledge, this is the largest single parallel corpu...

متن کامل

UM-Corpus: A Large English-Chinese Parallel Corpus for Statistical Machine Translation

Parallel corpus is a valuable resource for cross-language information retrieval and data-driven natural language processing systems, especially for Statistical Machine Translation (SMT). However, most existing parallel corpora to Chinese are subject to in-house use, while others are domain specific and limited in size. To a certain degree, this limits the SMT research. This paper describes the ...

متن کامل

Creating a Reusable English-Chinese Parallel Corpus for Bilingual Dictionary Construction

This paper first describes an experiment to construct an English-Chinese parallel corpus, then applying the Uplug word alignment tool on the corpus and finally produce and evaluate an English-Chinese word list. The Stockholm English-Chinese Parallel Corpus (SEC) was created by downloading English-Chinese parallel corpora from a Chinese web site containing law texts that have been manually trans...

متن کامل

Large - Scale Automatic Extraction of anEnglish - Chinese Translation

We report experimental results on automatic extraction of an English-Chinese translation lexicon, by statistical analysis of a large parallel corpus, using limited amounts of linguistic knowledge. To our knowledge, these are the rst empirical results of the kind between an Indo-Europeanand non-Indo-Europeanlanguage for any signiicantvocabulary and corpus size. The learned vocabulary size is abo...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002